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Abstract 

The entropy maximum approach (Maxent) was developed as a minimization of the subjective uncertainty 
measured by the Boltzmann-Gibbs-Shannon entropy. Many new entropies have been invented in the second 
half of the 20th century. Now there exists a rich choice of entropies for fitting needs. This diversity of 
entropies gave rise to a Maxent "anarchism" . The Maxent approach is now the conditional maximization of 
an appropriate entropy for the evaluation of the probability distribution when our information is partial and 
incomplete. The rich choice of non-classical entropies causes a new problem: which entropy is better for a 
given class of applications? We understand entropy as a measure of uncertainty which increases in Markov 
processes. In this work, we describe the most general ordering of the distribution space, with respect to 
which all continuous-time Markov processes are monotonic (the Markov order) . For inference, this approach 
results in a set of conditionally "most random" distributions. Each distribution from this set is a maximizcr 
of its own entropy. This "uncertainty of uncertainty" is unavoidable in the analysis of non-equilibrium 
systems. Surprisingly, the constructive description of this set of maximizers is possible. Two decomposition 
theorems for Markov processes provide a tool for this description. 
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1. Introduction 

Entropy was born in the 19th century as a daughter of energy: dS = SQ/T. Clausius [lj], Boltzmann 
and Gibbs Q (and others) had developed the physical notion of entropy. At the same time, the famous 
Boltzmann's formula S = klogW had opened the informational interpretation of entropy. In the 20th 
century, Hartley [3| and Shannon [f| introduced a logarithmic measure of information in electronic commu- 
nication in order "to eliminate the psychological factors involved and to establish a measure of information 
in terms of purely physical quantities" (Q, p. 536). Information theory is focused on entropy as a measure 
of uncertainty of subjective choice. This understanding of entropy was returned from information theory 
to statistical mechanics by Jaynes 0] as a basis of "subjective" statistical mechanics: "Information theory 
provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, 
and leads to a type of statistical inference which is called the maximum entropy estimate. It is least bi- 
ased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing 
information. That is to say, when characterizing some unknown events with a statistical model, we should 
always choose the one that has Maximum Entropy." This is the brief manifesto of the Maxent (maximum 
of entropy) methodology. 

Entropy is used for measurement of uncertainty in a probability distribution. The Maxent method finds 
the maximally uncertain distribution under given values of some moments. After Jaynes, this approach 
became very popular in physics @, Hj], statistics jUHIJ, econometrics 11, 12 1 and other disciplines. 
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The non-classical entropies were invented by Renyi [13j in the middle of the 20th century, simultaneously 
with the expansion of the Maxent approach. This invention introduced additional uncertainty in the un- 
certainty evaluation. Maximization of different entropies produces different probability distributions under 
the same conditions. Now, one has to select the proper entropy functional to use in the Maxent approach. 
This choice may be non-obvious. The beautiful and transparent understanding of the Maxent distribution 
as a unique "least biased estimate possible on the given information" is now destroyed by the non-classical 
entropies. If we consider the non-classical entropies seriously then we have to select the proper entropy for 
each problem. 

If we do not find solid reasons for the entropy selection then we have to accept this "Uncertainty of 
Uncertainty" (UoU) as the nature of things. In this case, the set of all the Maxent distributions for different 
entropies will evaluate the unknown "maximally uncertain" distribution under given conditions. We call 
this method of handling the UoU the "maximization of all entropies" or Maxallent. If there are some reasons 
for selection of a class of entropy function then we have to select the conditional maximizer of the entropies 
from this class. 

The widest class of entropies we use in this paper are the Csiszdr-Morimoto conditional entropies (f- 



divergencies). They were introduced by Renyi in his famous work J13( where he proposed also the "Renyi 



entropy". The /-divergencies were studied further by Csiszar [14| and T. Morimoto [15j. For a discrete 
probability distribution P = (pi) and the positive "equilibrium distribution" P* = (p*), p* > the general 
form of the /-divergence is 



Pi 



(i) 



where h(x) is a convex function defined on the open (x > 0) or closed (x > 0) semi-axis. We use here the 
notation Hh(P\\P*) to stress the dependence of Hh both on pi and p*. 

In some practical problems, it is convenient to use a convex function h{x) with singularity at x = 0, for 
example, h(x) = — In a; (the Burg relative entropy [HI). Therefore, we assume that the function Hh(P\\P*) 
is defined for positive P and P* . Convexity of h(x) implies convexity of Hh(P\\P*) as a function of P. It 
achieves its minimal value on the equilibrium probability, P = P* (under conditions YliPi = 1; an d Pi > 0). 
If h(x) is strictly convex then Hh{P\\P*) is also strictly convex and this minimizer (the equilibrium) is 
unique. 



1.1. Maxallent, approach #1: parametrization by monotonic function of one variable 

The standard settings for the Maxent approach are: an event space f2, a divergency Hh{P\\P*) and a 
set of moments M r {P) (r = 1, . . . , k) are given. Here, P is a probability distribution, P* is the "maximally 
disordered" probability distribution ("equilibrium") and Hh(P\\P*) measures the deviation of P from P*. 
Of course, for general probability spaces we have to assume that P is absolutely continuous with respect 
to P* and that it is possible to compute the divergence Hh(P\\P*)- The Maxent problem is: for given 
values of the moments M r {P) (r = l,...,k) find the minimizers of Hh(P\\P*). That is, on the set of 
probability distributions with given values of M r {P) (r = 1, . . . , k) find the distributions that are the closest 
to the equilibrium P* if we measure the deviation by Hh(P\\P*). The terminological mess (Maxent and 
mmimizcrs) appears due to historical reasons. Divergences measure the differences between distributions 
and we always look for minimizers of them. 

To avoid the irrelevant technicalities we consider discrete distributions. Let 51 = {Ax, A2, ■ . . , A n } be a 
finite event space with probability distributions P — (pi). The set of probability distribution is the standard 
simplex A" -1 in R". The set of positive distribution (pi > 0) is A™ -1 , the relative interior of the standard 
simplex. 

The Maxent problem for Hh(P\\P*) and given values of moments ^ . m r jPj = M r (r = 1, . . . , k) reads: 
find P G A"" 1 such that 

Hh(P\\P*) — > min, subject to m r jPj = M r . 
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The total probability condition gives ■ mojPj = 1 (moj = 1, Mo = 1)- Assume that fc + 1 < n and 

rank(m rj ') = fc + I (r = 0, 1, . . . , k; j = 1, 2, . . . , n). 

If rank(m r j) < k + 1 then just exclude some moments. 
The method of Lagrange multipliers gives for P G A" - 

k 

= ^2\ r m rj (j = l,...,n). (2) 

The derivative h' is a monotonic function. Let h be strictly convex. Then the inverse function g(y) exists, 
g(h'(x)) = x (for positive x). We can apply the function g to both sides of ([2]) and write the expression of 
P and the equations for the Lagrange multipliers A r that are just the moment conditions m r jPj = M r : 




P 3 = P*j9 Km rj j (j = 1, . . . , n); 
^2 m pjP*j9 X] X rm rj = AI P (p = 0, . . . , k) 



(3) 



Therefore, for the class of the strictly convex functions h all the positive solutions of the Maxent problem 
for all f -divergencies are parameterized |[3J) by the monotonic function g. 

The function g should be defined on a real interval (a,b) = /i'((0,oo)) (it might be that a = — oo or 
b — oo). The image of g should be the real semi-axis (0, oo) because p/p* may be any positive number. 
Therefore, \wn y ^ a g(y) = and for finite a the function g is defined on [a, b). For each monotonically 
increasing function g on a real interval (a, b) with im g = (0, oo), the corresponding solution of the Maxent 
problem is given by the distribution 0, where \ are the solutions of the corresponding equation. This 
solution of the Maxent problem is the conditional minimizer of Hh(P\\P*) with h(x) = J h'(x)dx, where 
h'{x) is the inverse function of g(y), i.e. h'(x) = y, where y is the solution to the equation g(y) = x. The 
additive constant in J h'(x)dx does not affect the solution of any Maxent problems and may be chosen 
arbitrarily. Thus, we present the parametric description of the minimizcrs of all strictly convex divergences 
Hh(P\\P*)- A monotonic function g with the values range (0, oo) serves as a parameter in this description. 

For the existence of a positive distribution P which satisfies (J3j> the moment conditions £^ . m r jPj = M r 
(p = 0, . . . , fc) should be compatible with the positivity of Pi. Of course, for arbitrary g this may be not 
sufficient for the existence of such a positive distribution. To guarantee the existence of a positive Maxent 
distribution it is sufficient to add to the function h(x) a term ex\nx with arbitrarily small positive e. This 
term creates a logarithmic singularity of h'(x) at zero. It is easy to check that this singularity guarantees the 
existence of a positive solution of ((3J) if the moment conditions are compatible with the positivity of pi. For 
some applied purposes an additional term — elnx may be even more convenient (l7| because it guarantees 
the logarithmic singularity of entropy and h'(x) has the singularity ~ —1/x at zero. 

In this paper, the question about existence of the positive Maxent distribution is not important. We 
need only the conditions ([3]) which are necessary and sufficient for a positive distribution P = (pi) to provide 
a minimizer of the given /-divergency under moment conditions. 

1.2. Maxallent, approach #2: the Markov order 

Any Markov process with equilibrium P* increases disorder. The classical Boltzmann-Gibbs-Shannon 
entropy grows in Markov processes. This theorem (the "data processing lemma") was proved in the first 
)aper of Shannon [H[ but of course the entropy growth in kinetics was known before (Boltzmann's iJ-theorem 
H and its generalization for the systems without detailed balance [Si- 

A. Renyi proved in the first paper about the non-classical entropies [13j that all /-divergencies ([T]) decrease 
in Markov processes with equilibrium P*. Later on, it was demonstrated that this property characterizes 
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/-diver gen cies among all functions which can be presented in the form of the sums over states (the "trace 



form") [2l|,l22|,|23| 



The generalized data processing lemma was proven (24J . |25[ : For every two positive probability distribu- 
tions P, Q the divergence Hh(P\\Q) decreases under action of a stochastic matrix A = (a^) 

H h (AP\\AQ) <a(A)H h (P\\Q), 

where 

a{A) = i max I ^ \ aij - a kj \ 

is the ergodicity contraction coefficient, < a(A) < 1. 

A second method of handling the UoU is based on a simple remark: "uncertainty of a probability 
distribution should increase in Markov processes" . More precisely, let the most uncertain distribution P* 
be given (the equilibrium) . If a distribution P' can be obtained from a distribution P in a Markov process 
with equilibrium P* then we can assume: 

uncertainty of P < uncertainty of P' . 

Thus, we do not care about the values of the uncertainty measure, we just compare the uncertainty of 
distributions: P' is more uncertain than P under given equilibrium P* (in this sense, the values vanish but 
the (pre)order appears [IH). 

In the Maxent approach, the entropy is used as a (pre)order in the distribution space, not as a function, 
and the values are not important because any monotonically increasing transformation of the entropy does 
not change the solution of the Maxent problem. Of course, in some other applications the values of entropy 
are important: in coding theory (bits per symbol) and in thermodynamics (dU = TdS) the values of 
the entropy have a specific important sense. Nevertheless, when we discuss the entropy as a measure 
of uncertainty and work with the huge population of non-classical entropies, these entropies are, in their 
essence, (pre)orders on the space of distributions. 

We consider the continuous time Markov processes with a given equilibrium distribution P* . By definition, 
the equilibrium is the unconditionally maximally uncertain distribution. To add the moment conditions 
we define a linear manifold in the space of distributions. For every non-equilibrium distribution P each 
Markov process with the equilibrium distribution P* determines the direction of P evolution, dP/dt. In 
this direction, the distribution becomes more uncertain. Let us take this property as a definition of the 
uncertainty. Instead of an entropy functional we use the transitive closure of this relation, define an order 
on the space of distributions and call it the "Markov order" [23[ . 

Let Q(P, P*) be a cone of possible time derivatives dP/dt for a given probability distribution P, the 
equilibrium P* , and all Markov processes with equilibrium P* . 

For fixed values of moments, M r , the conditionally linear manifold L in the space of the probability 
distributions is given by equations J^j rrirjPj = M r (f — 0, . . . , fc). We can consider P° £ L as a possibly 
extremely disordered distribution on L, if for any Markov process with equilibrium P* the solution P(t) of 
the Kolmogorov equation with initial condition P(0) = P° has no points on the conditionally linear manifold 
L for t > (we assume that P° is not a steady state for this process). Instead of this global condition, we 
consider the local condition (Fig. [T]). 

Definition 1. The distribution P° £ Lf) A™ -1 is a local minimum of the Markov order on L(~] A™ -1 if 



{P° + Q(P ,P*))f]L = {P }. (4) 



Further, for short, we can omit A™ -1 and call P° "a local minimum of the Markov order on L". In this 
definition, we substitute the trajectories P(t) by their tangent directions at point P°, dP(t)/dt £ Q(P°, P*). 
In Sec. [2] we justify this substitution and prove that the local condition (TJ| holds if and only if for every 
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Figure 1: The local condition Q: P° may be an extremely disordered distribution on the condition linear manifold L if the 
set P° + Q(P°, P*) intersects the linear manifold of conditions at the only point P°; (a)(P° + Q(P, P*)) n L = {P } and P° 
may be an extremely disordered distribution on L; (b) (P° + Q(P, P*)) (1L^ {P } an d P° nas no chance to be an extremely 
disordered distribution on L. In case (b), there are more disordered distributions on L achievable by the Markov processes 
from the initial distribution P°. 

Markov process with equilibrium P* the solution P(t) of the Kolmogorov equation with initial condition 
P(0) = P° has no points on the condition linear manifold L for t > (if P° is not a steady state for the 
process). 

For applications, we need the local minima condition formalized by Definition Q] and the local order gen- 
erated by the cone Q(P°, P*)) only. The general notion of (global) Markov order appears later, in Section^ 
where we prove equivalence of the Maxima of all entropies and the Markov order approaches. Surprisingly, 
the set of the conditional minimizers of all /-divergencies and the set of the conditionally minimal elements 
of the Markov order coincide for the same conditions (Sec. [3]). These sets include all reasonable hypotheses 
about conditionally most uncertain distributions. Let us call the problem of description of all the conditional 
minima of the Markov order the Maxallent problem. 

1.3. Main tool: decomposition theorems 

The main tools for constructive work with the Markov orders are the decomposition theorems for Markov 
chains. The first decomposition theorem states that every Markov chain with a positive equilibrium distri- 
bution is a convex combination of the simple directed cyclic Markov chains with the same equilibrium. The 
coefficients in this decomposition do not depend on the current probability distribution: the vector field 
dP/dt for a general Markov chain is a convex combination of these vector fields for simple cyclic Markov 
chains with the same positive equilibrium. 

The second decomposition theorem states that for every Markov chain with a positive equilibrium distri- 
bution and for any non-equilibrium distribution P the velocity vector dP/ dt is a convex combination of the 
velocity vectors for the simple cyclic Markov chains of the length two with the same equilibrium (i.e. of the 
reversible transitions between two states, Ai ^ Aj). The coefficients in this decomposition typically depend 
on the current probability distribution. 

The idea of the first decomposition theorem was used by Boltzmann in 1882 [T^j in his proof of the 
-ff-theorem for systems without detailed balance. (This was his answer to the Lorentz objections [l9|.) He 
did not formulate this theorem separately but efficiently used the cycle decomposition for generalization of 
detailed balance. Later on, his extension of the detailed balance conditions were analyzed by many authors 
under different names as "cyclic balance" , "semi-detailed balance" or "complex balance" (see, for example, 
the review [I(|). Now, the theory of the cycle decomposition is a well developed area of the theory and 
applications of the random processes ■ 

The second decomposition theorem is less known. We found this theorem in the analysis of the Markov or- 
der [23I ] . This decomposition means that for the general first-order kinetics and an arbitrary non-equilibrium 
probability distribution P there exists a system with detailed balance and the same equilibrium that has the 
same velocity dPj dt at point P [27j : the classes of the general Markov processes and the Markov processes 
with detailed balance are pointwisc equivalent. 

The decomposition theorems are discussed in Appendix B in more detail. 
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2. Local minima of Markov order 



Let us consider continuous time Markov chains with n states A\, . . . , A n . The Kolmogorov equation (or 
master equation) for the probability distribution P = (j/i) is 



dpi 
dt 



3, 



(5) 



where qtj (i,j = l,...,n,i^j) are non-negative. 

In this notation, qij is the rate constant for the transition Aj Ai. Any non- negative values of the 
coefficients q^ (i j) correspond to a master equation. Therefore, the set of all the Kolmogorov equations 
([5|) may be considered as the positive orthant l ' in JR'^™ -1 ) with coordinates q^ (i ^ j). 

Now, let us restrict our consideration to the set of the Markov chains with the given positive equilibrium 
distribution P* (p* > 0). 



p* for all i = 1, 



(6) 



This system of uniform linear equations define a cone of the q^ (i, j = 1, . . . , n, i ^ j) in R™^" 

Under the balance condition ([5]), the Kolmogorov equations ([5]) may be rewritten in a convenient equiv- 
alent form: 

We use below one of the /-divergencies ([T]) with h(x) = (x — l) 2 . It is a quadratic divergence, the 
weighted l 2 distance between P and P*: 




H 2 (P\\P*)=J2 



Pt 



With the master equation in the form (|7]l. it is straightforward to calculate the time derivative of 
H 2 (P\\P*) 

dH 2 (P\\P*) ^ J Pi p" 



< 0. 



(8) 



Each term in the sum is non-negative. The time derivative ((5} is strictly negative if for a transition Aj — > Aj 
the rate constant is positive, qij > 0, and 7^ p-. Hence, if the state P is not an equilibrium (i.e., the 

right hand side in (J7J is not zero) then dH2 ^ P - < 0. 

An important class of the Markov chains is formed by reversible chains with detailed balance. The 
detailed balance condition reads: 



QijPj = QjiP* for all i, j = 1, . . . , ; 



(9) 



Under this condition, there are only "^" 2 ^ independent coefficients among n(n — 1) numbers qij. For 
example, we can arbitrarily select q^ > for i > j and then take q^ = qjv^t for i < j. So, for given P* , 



the cone of the detailed balance systems (j9]) is a positive orthant in 
equilibrium fluxes 



embedded in 



i(n-l) 



The 



arc the convenient coordinates in 



j = HjPj = QjiPi (* > 3) 
for a description of the systems with detailed balance. 
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Let Q(-P, P*) be the set of all possible velocities dP/dt at a non-equilibrium distribution P for all Markov 
chains which obey a given positive equilibrium P* . According to the second decomposition theorem, the set 
of all possible velocities dP/dt for the chains with detailed balance and the same equilibrium is the same 
cone Q(P, P*). Therefore, Q(P, P*) is a convex polyhedral cone and its extreme rays consist of the velocity 
vectors for two-state Markov chains Ai ^ Aj with rate constants qji = n/p*, Qjj,= ^/p* (n > 0). 

The construction of the cones of possible velocities was proposed in 1979 [28[ for systems with detailed 
balance in the general setting, for nonlinear chemical kinetics. These systems are represented by stoichio- 
metric equations of the elementary reactions coupled with the reverse reactions: 

a pl A! + ... + a pn A„ ^ $ pl A x + ...+ /3 pn A n , (10) 

where a p i, j3 p i > are the stoichiometric coefficient, p is the reaction number (p = l,...,m). The stoi- 
chiometric vector of the pth reaction is an n dimensional vector 7 P with coordinates j p i = j3 p i — a P i. The 
reaction rate is w p = w+ — w~ , where iu+ is the rate of the direct elementary reaction and w~ is the rate 
of the reverse reaction 

The equilibria of the pth pair of reactions (| 10[) form a hypersurface in the space of concentrations. The 
intersection of these surfaces for all p is the equilibrium (with detailed balance) . Each surface of the equilibria 
of a pair of elementary reactions (|10j) divides the non-negative orthant of concentrations into three sets: (i) 
w p > 0, (ii) w p = (the surface of the equilibria) and (iii) w p < 0. All the surfaces of equilibria (w p = 0) 
divide the non-negative orthant of concentrations into compartments. In each compartment, the dominant 
direction of each reaction (fit))) is fixed and, hence, the cone of possible velocities is also constant. It is a 
piecewise constant function of concentrations: 

Q = cone{7 P sign(w p ) | p = 1, . . . , m} , 

where "cone" stands for the conic hull, that is the set of all linear combinations with non-negative coefficients. 
Here and below we use the three- valued sign function (with values ±1 and 0). 

Let us apply this construction to Markov chains with detailed balance. Let us join the transitions 
A{ ^ Aj in pairs (say, i > j) and introduce the stoichiometric vectors 7 JI with coordinates: 

r -i itk=j, 

ll l = \ 1 ifft = i, (11) 
[ otherwise. 

Let us rewrite the Kolmogorov equation for the Markov process with detailed balance © in the quasichemical 
form: 

Here, w*j = qijP* = QjiP* is the equilibrium flux from Ai to Aj and back. 



The cone of possible velocities for (fT2j) is 



Q(P, P*) = cone < 7 Ji sign 




i > j 



(13) 



The standard simplex of distributions P is divided by linear manifolds ft = into compartments. They 

Pi Pj 

are the polyhedra where the cone of the local Markov order Q(P, P*) is constant. The compartments for 
the Markov chains with the positive equilibrium P* correspond to various partial orders on the finite set 
{Pi/Pt} (* = l,..-,n). 

Let us describe the compartments and cones in more detail following [23j. For every natural number 
k < n— 1 the fc-dimensional compartments are enumerated by surjective functions a : {1,2, ...,n} — > 

{1, 2, . . . , k + 1}. Such a function defines the partial ordering of quantities 21 inside the compartment: 

Pj 

^> P j if c(0 <*(?); v i = v i if = (14) 
Pi Pj p t Pj 
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Figure 2: The cones of possible velocities Q for all Markov chains with three states and equilibrium equidistribution (p* = 1/3). 
The triangle of the probability distributions (pi,P2>P3) (Pi > 0, pi + P2 + P3 = 1) has the vertices Ai, where pi = 1 and other 
probabilities are zeros. Equilibrium is the centre of the triangle. This triangle is divided by three lines of partial equilibria 
(Ai Ai) into 12 compartments and the equilibrium point. Six compartments are triangles and six other compartments are 
segments. For all compartments the cones (here the angles) of possible velocities are shown. Each cone is connected with the 



corresponding compartment by a dashed line. In each cone, the vectors 7^ ! sign 



are presented. For the 2D (triangle) 



compartments all three vectors are non-zero. For the ID compartments (segments) only two these vectors are non-zero. The 
vectors 7 J1 are presented separately, in the top left corner. 

Let C a be the corresponding compartment and Q a be the corresponding local Markov order cone (Q(P, P*) = 
Q a if PeC a ). 

For a given surjection a the compartment C a and the cone Q a have the following description: 



P 



for a(i) = a(j) and -± 



> ^ for a{j) = a(i) + 1 
Pi 



= conclY 3 | cr(j) = a{i) + 1}. 



(15) 



In Fig. [21 the partition of the standard distribution simplex into compartments, and the cones (angles) 
of possible velocities are presented for the Markov chains with three states. In the construction of this cone, 
reversible chains with detailed balance are used. Due to the second decomposition theorem, this construction 
of the cone of possible velocities is valid for the class of general Markov chains (and not only for reversible 
chains) with the same equilibrium. It seems quite surprising that the Markov order for general Markov 
chains is generated by the reversible Markov chains which satisfy the detailed balance principle. 

Let I be a linear manifold in the probability distribution space. Due to Definition [1] P° 6 L n A™ -1 is 
a local minimum of the Markov order on L n A? -1 if the condition Q holds. 
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Figure 3: a) and b) Extreme points of the Markov order for the Markov chain with three states and different positions of the 
condition line L. c) Extreme points of the Markov order coincide with the partial equilibria, when the moments are just some 
of Pi 



In Fig. [3] the sets of conditional minimizers are presented for the Markov order on the straight line L for 
the Markov chain with three states and symmetric equilibrium (p* = 1/3). Two general positions of L in 
the probability triangle are used (Fig. [3^,b). If L is parallel to one side of the triangle (Fig. [3b) then the 
moments are just some of the pi and the extreme points of the Markov order on L n A"" 1 coincide with the 
partial equilibria. 

Let J be a set of pairs of indexes (i, j) (i > j) and tCj be the class of kinetic equations (|T^1) with w*j = 
for £ J and w*j > for <G J (i ^ j). We define <3>j(P°) for an initial distribution P° as a set of 
all values P(t) (t > 0) for solutions P(t) of all equations from the class Kj with initial value P(0) = P°. 

Consider a cone of possible velocities for the set of transitions A; ^ Aj, <E J: 



Qj(P, P*) = cone < 7 : ' i sign 



Pj 



The following proposition states that in a vicinity of the distribution P° the sets <J>j(P°) and P° + 
Qj(P°, P*) coincide. This gives a justification of the use of the cone of the tangent directions Qj(P°, P*) 
in the definition of the local minima of the Markov order (j4]). 

Proposition 1. Let p- — £§■ ^ € J) for a distribution P = P°. There exists a vicinity U of P° 

where P° + Qj(P°,P*) coincides with $./(P°): 

(P° + Qj(P°,P*)) n u = $j(p°) n u . 

Proof. There exists a Euclidean ball B r around P° where |r — ^j- ^ € J). Due to ©, inside B r , 

the divergence P 2 (P||P*) strictly decreases with A increasing along any ray P° + Ae, e G Qj(P P*) (A > 0). 
For each ray, we can find the minimum of i?2(P||P*) in B r . Let the maximum of these minima be h r (P°): 

h r (P°)= max <fmin{P 2 (P° + AellP*) IP + Ae £ B r \ 

egQj(P,P») |_ A>0 

By construction, H 2 (P°\\P*) > h r (P°). The set 

U = {PeB r \H 2 (P\\P*)>h r (P Q )} 
is a vicinity of P°. The intersection (P° + Q,/(P°, P*)) fl C/ is 

(P° + Qj(P°, P*)) nu = {Pe(P a + Qj(p°,p*)) | H 2 (P\\P*) > h r (P )} . 

For any system from JCj on P r and for any distribution P E (P° + Qj(P°, P*)) the velocity vector dP/dt 
belongs to Qj(P°, P*). Obviously, (P + Qj(P°, P*)) C (P° + Qj(P°, P*)). Therefore, the solution of this 
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system with the initial condition P(0) = P° may leave the intersection (P° + Qj(P°,P*)) fl U through 
the level surface H2(P°\\P*) = h r (P°) only. After that, the solution cannot return to U because in U the 
value of H 2 (P°\\P*) are bigger, H 2 (P°\\P*) > h r (P°), and H 2 (P(t)\\P*) should decrease in time along every 
solution of any system from ICj. Thus, one inclusion is proven, 

(P° + Qj(p°, p*)) nUD $j(p°) n u . 

To prove the second inclusion, (P° + Qj(P° , P*)) n U C $/(P°) n U, we have to demonstrate that the 
solutions P{t) (P(0) =P°,t> 0) of the equations from JCj cover (P° + Qj(P°, P*)) in some vicinity of P°. 
The polyhedral cone Qj(P°,P*) is covered by the simplicial cones spanned by the sets of linearly 

independent vectors 7 :,I sign( jp- — ) . Therefore, it is sufficient to prove the second inclusion for the simplicial 

cones Qj(P°,P*). 

Let the vectors {7 l,l |(i ) i) G ^} be linearly independent. For the simplicity of notation, let us enumerate 
the states in the order of the values of /p* : 

El > El > > E« 

Pi ~ P2 ~ " ' ~ Pn ' 

In these notations, sign(p- — ^t) = 1 for all (i, j) G J because i > j and ^4- — p- 7^ for G J. 
Consider a subset of the cone Q,/(P°,P*) (a "pyramid") 

Qj(P°,P*) = \ E %^ ' E %<l}- (16) 

The "base" of this pyramid is a simplex 



Bj(P°,P*)={ E 

(t,j')eJ (ij)e./ 



u — 0, ^ ij 



Let a > be sufficiently small and, therefore, U - 4- 7^ ((i, j) G J) in P° + aQj(P°,P*). For 

this a, a solution P(i) (t > 0) of an equation from the class K.j with initial data P(0) = P° may leave 
P° + aQj(P°, P*) only through its base, P° + aSj(P, P*). 

Let us prove that if a is sufficiently small then for each point y G Bj(P, P*) there exists a system in JCj 
whose solution P(t) (t > 0, P(0) = P°) leaves P° + aQj(P°, P*) through the point P° + ay. This means 
that P(ti) = P° + aa; for some ti > and P(t) G Q,/(P, P*) for < t < t x . 

Each vector x G Bj(P, P*) can be expanded into a linear combination of 7 Jl G J): 

x = E ^ ji > ^ and E 6 v = L 

(i,j)GJ (»,j)GJ 

With this expansion we define the system K x G Kj by the condition ^r\ p _ p0 = x: 




(18) 



can 



(Just take w* } = % (U - |£) for (ij) G J in ((T2J.) A solution P(t) (P(0) = P°) of this equation ([I 
be also expanded into a linear combination of 7 Jl ((«', j) G J) (x G Bj(P, P*)): 

P(i) = P° + to + y/(t, x) = P° + E + {^»})) T*> (19) 



'y>0 
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where Vij (t, {9i m }) are analytic functions. If x belongs to a face F of the cone Qj(P°, P*) then P(t) G P°+F 
for sufficiently small t. 

The moment t = t(a, x) when the solution P(t) (fT^|) leaves P° + aQj(P°, P*) is a root of equation 

t 2 

! + y ^ ^(i,{<W) = a. 

Due to the standard inverse function theorems this root exists and the function t(a, x) is smooth for suffi- 
ciently small a for all x G Bj(P, P*), and t(a, x) = a + o(a). The solution P(t) ([19]) of the system K x (fT8|) 
leaves P° + aQj(P°, P*) at the point P(t(a, x)) = P° + ay(x), where j/(x) G Sj(P, P*). 

To prove that ?/(•) : Bj(P,P*) -> Bj(P,P*) is a homeomorphism of the simplex Bj(P,P*) onto itself, 
let us notice that the map x i-> y(x) leaves the faces of the simplex Bj(P, P*) invariant: vertices transform 
into themselves, the same for edges, etc. 

We use the following topological lemma, the multidimensional intermediate value theorem. Consider a 
continuous map : A„ — > A n of the n-dimcnsional standard simplex into itself. Let each face F C A n be 
^-invariant, i.e. \&(P) C F. Then "J is surjective. The proof is possible by induction in n: for n = it 
is obvious, for n = 1 this is just a ID intermediate value theorem. In all dimensions, it can be proved on 
the basis of the "no-retraction theorem" [29| and simple inductive topological reasoning, which reduces the 
general case to the situation when all the faces F C A n consist of fixed points of the map "J. 

Therefore, for sufficiently small a the solutions P(t) (P(0) = P°, t > 0) of the equations from ICj cover 
(P° + aQj(P° , P*)) in some vicinity of P°. The second inclusion is proven. Let us combine the inclusions 
and reduce the vicinities, if necessary. □ 

If foralH,j {ijtj) then 

Q(P°,P*) = Q(P,P*) 

for P in some vicinity of P°. If for some pairs i,j (i / j) % = t (see Fig. then for some P € 

Pi Pj 

P°+Q(P°, P*) the cone Q(P, P*) may be bigger than Q(P°, P*) even in a small vicinity of P°. Nevertheless, 
the set of trajectories P(t) (t > 0, P(0) = P°) remains in P° + Q(P°,P*) for sufficiently small t. Let us 
prove this statement. 

Let K, be the class of all master equations with detailed balance with the positive equilibrium P* (|12|) 
with w*j > for all (i, j) (i > j). We define $(P°) for an initial distribution P° as a set of all values P(t) 
(t > 0) for solutions P(t) of all equations from the class JC with initial value P(0) = P°. 

Proposition 2. Por every probability distribution P° there exists a vicinity U of P° where P° + Q(P°,P*) 
coincides with $(P°): 

(p° + Q(P°, p*)) n C/ = $(P°) n [/ . 

Proo/. The inclusion (P° + Q(P°,P*)) n U C $(P°) n [/ is proven in the second part of the proof of 
Proposition [1] because Kj C /C. We have to prove the inclusion (P° + Q(P°,P*)) n U D $(P°) n [/. 

Let us use the combinatorial description of compartments and cones (fT5j) . We assume that P° G Ccr for 
a surjection a : {1, 2, . . . , n} — > {1, 2, . . . , fc + 1}. Let us recall that fc = dimC CT . If fc = n — 1 then C a is an 
open subset of the distribution space and the preimage of every I = 1,2, ... ,n consists of one point. For 
every P G C a the cone Q(P, P*) coincides with Q(P°,P*) and due to Proposition [T] there exists a vicinity 
U of P° where P° + Q(P°, P*) coincides with $(P°). 

Let k < n— 1. Then for some £ = 1, . . . , k + 1 the preimage of i includes more than 1 point, |er -1 (i)| > 1. 
Let / be the set of such i and Si = cr _1 (j) is the preimage of i. Due to (|T5j) . 

Q(P°,P*) = cone{ 7 y | <r(j) = a(i) + 1}. 

For a sufficiently small ball U r with the centre P° and P G (P° + Q(P°,P*)) n U the cone Q(P,P*) may 
include also some r ) tJ with a(i) = a(j) but 

Q(P, P*) C cone{ 7 ^ | a(j) = + 1 or a(j) = a(i)}. (20) 
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Let us prove that for any Markov chain with equilibrium P* for sufficiently small time r > and a ball 
U r /2 with the centre P° the solutions of the Kolmogorov equations P(t) do not leave P° + Q(P°, P*) during 
the time interval [0, r] if P(0) G (P° + Q(P°, P*)) n C/ r/2 . 

A set V is positively invariant with respect to a dynamical system if every motion that starts in V at £ = 
remains there for i > 0. Let a convex set be positively invariant with respect to several dynamical system 
given by Lipschitz vector fields wi, . . . , w r . Then V is positively invariant with respect to any combination 
w = /i w i> where fj are non- negative functions and w is a Lipschitz vector field. Therefore, the problem 
of positive invariance of a convex set with respect to such combinations of vector fields can be "split" into 
problems of the positive invariance of V with respect to summands Wj . Due to the second decomposition 
theorem, we can always assume that the vector field of the Kolmogorov equation for the Markov kinetics 
is a linear combination of the vector fields of the pairs of elementary transitions Ai =± Aj with the same 
equilibrium. The coefficients in these combinations are non-negative functions. 

The motion P{t) with P(0) G (P° + Q(P°,P*)) does not leave (P° + Q(P°,P*)) in time t G [0,r] if 
dP{t)/dt G Q(P°, P*) on [0,r]. 

The cone Q(P°,P*) is generated by vectors 7 y with a(j) = a(i) + 1. To generate a cone Q(P, P*) for 
a point P <E U r we have to add to the set of 7 U (er(j') = a{i) + 1) some of 7 y with cr(j') = er(i). Let us 
consider the pyramid (compare to ([T| 



Q( p0 ) = \ E brf* %^ ' E 

We will prove that the set P° + aQ(P°) is positively invariant with respect to any first order kinetics with 
transitions Ai =± Aj (i,j G Si) and equilibrium P* for any i = l,...,fc+l. 

It is sufficient to consider dynamics in projections on the coordinate subspacc Rs, with coordinates pi, 
i G Si for every I G / separately. In this space, vectors 7 U (i, j £ Si) correspond to the standard first order 
kinetics like (fl"2|) with the reduced vector P G Rs, but without compulsory unit balance (%2 ieS Pi = const 
with any const > 0). A projection of P° on Rs,, Pg, is an equilibrium for this first order kinetics with the 

balance Y, te s, Pi = Ei G S, Pi because || = |j for i, j G 5;. 

The vectors 7 1 -? that generate Q(P°, P*) (<r(j) = cr(i) + 1) (|2"0Jl have non-zero projections on Rs, if and 
only if either I = a(j) = a(i) + 1 or a(j) = a{i) + 1=1 + 1. In the first case, / = a(J) = a(i) + 1, vector 7 U 
is the standard basis vector ej in Rs r In the second case, <r(j) = a(i) + 1 = 1 + 1, we have 7 y = — e,. If 
/ = 1 then only the second case is possible, and if I = k + 1 then only the first case can take place. 



Let Vi = {P G Rs, | Pi > 0, X)ieS; P* < -*■}■ ^ ne projection of the pyramid Q(P ) onto Rs 



is 



conv(V^ - V;)) if 1 < I < k + 1; it is V/ if I = k + 1 and -Vi if / = 1. (For sets X, Y, the sum X + Y is the 
set of all sums x + y (x G X, y G Y"), the difference X — Y" is the set of all differences x — y, therefore V — V 
is not {0} if V includes more than one element.) 

The set Vi is positively invariant with respect to the first order kinetics in Rs, . Therefore, the following 
sets are also positively invariant with respect to the first order kinetics in Rs, with equilibrium Pg for every 
a > 0: 

P| ; + aV h {P$ t + aVi) - aVi, conv((P° + aV t ) - aV t ) . 

Thus, the set P° + aQ(P°) is positively invariant with respect to any first order kinetics with transitions 
Ai r= Aj (i,j G Si) and equilibrium P* for any I = 1, . . . , k + 1. A combination of these statements for all 
I = 1, . . . , k + 1 finalizes the proof. □ 

This proposition finalizes the justification of the use of the cone of the tangent directions Q(P°, P*) in 
the definition of the local minimum of the Markov order @. 

3. Equivalence of the maxima of all entropies and the Markov order approaches 

The cone Q(P°, P*) is a piecewise constant function of P°: it is the same for all P° from one compartment 
C a and, hence, depends on a only. Therefore, if the condition of the local minimum (j4]) holds for one 
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P° G L n Co- then it holds also for all elements of L n C CT . There is a finite number of compartments C CT . 

Let the linear manifold of conditions L be given by the values of moments ^ i m r iPi = M r , L° = kerm 
and L n A™ -1 ^ 0. The set of all conditional local minima of the Markov order on the linear manifold of 
conditions L is 

(21) 



|J {L n C CT I L n C CT 7^ and L° n Q ff = {0}} , 



where C a and Q a are defined by (|T5|) . It is sufficient to find all a such that L n C CT 7^ and L° n Qo- = {0} 
and then describe the union of the compartments C a for these a. 

The approach based on the minimization of all /-divergencies seems to be very different. For all mono- 
tonically increasing functions g we have to solve the equations for the Lagrange multipliers and represent 
the probability distribution in the form ([3]). Nevertheless, these approaches are equivalent and describe the 
same set of the "conditionally maximally disordered distributions" . 

Theorem 1. A positive distribution P° G L satisfies the local conditional minimum conditions of the Markov 
order if and only if there exists a strictly monotonic function g on K with img = (0, 00) such that the 
conditions (0) hold for some Lagrange multipliers and for p\ = p%. 

This means that every conditionally minimal distribution of the Markov order on the linear manifold 
L n A™ -1 is a conditional minimum on L n A™ -1 of a strictly convex /-divergence ([1}. 

Proof. Due to the classical theorems about separation of convex sets and linear spaces by linear functionals 
[3p| , a distribution P° satisfies the condition of the local minimum ((4J if and only if there exists a linear 
functional ip{P) = J2 t *PiPi such that ip\ L = ^(P°) = const and ip(P) > tj}{P°) for every P G P° + Q(P°, P*) 
if P 7^ P°. In other words, = and 

>0 if ^i^^i. (22) 

according to the definition of Q(P, P*) ([T3)l . Condition ip\ L o = is equivalent to the existence of the 
coefficients A r such that for all i 

ipi = ^2 X r m ri . 

r 

Condition (|22[) is equivalent to the existence of a strictly monotonic function r/(x) defined for x > such 
that 



ipi = V 



.Pi 

To find such a function 77(2;) we can take the known values ipi for x = Pi/p* and then use, for example, linear 
interpolation r)(x) between p°/p*. To extrapolate r](x) from max{p^/p*} to +00 we can use an increasing 
linear function. To extrapolate r/(x) on the interval (0, min{p°/p*}) we can use elogx + const. 

Finally, we can take h'(x) = n{x), h(x) = J rj(£)d£; and g(y) is the inverse function: g(r](x)) = x for 
x > 0. The distribution P° is the local minimum of H h (P\\P*) on L. 

Conversely, if P° is a minimum of a strictly convex Lyapunov function H on L and dH/dt\pa < for 
every Markov chain with equilibrium P* for which P° is a non-equilibrium distribution then we can take 



dH 

dpi 



This choice of ipi provides (j2"2"]l (because H is strictly decreasing in time Lyapunov function) and iP\lo = 
because grad-ff is orthogonal to L (the condition of local minimum). □ 

This equivalence of two definitions of the maximally uncertain distribution under given conditions has 
several important consequences. 

Let us introduce the notion of the (global) Markov order [23J . 
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• If for distributions P° and P 1 there exists such a Markov process with equilibrium P* that for the 
solution of the Kolmogorov equation with P(0) — P° we have P(i) = P 1 then we say that P° and P 1 
are connected by the Markov preorder [23[ with equilibrium P* and use notation P° )~° pt P 1 . 

• The (global) Markov order is the closed transitive closure of the Markov preorder. For the Markov 
order with equilibrium P* we use notation P° yp* P 1 . 

The local Markov order at point P° is just a vector order generated by the tangent cone Q(P°, P*) [23j ]. 
We use for this local order the notation > P o jP „: 

P > P n iP , p> if P' - p e Q(P°, P*). 

The proofs of Propositions [T] and [5] give us the possibility to use the relation P° > P o P , P 1 instead of 
the Markov preorder for the definition of the Markov order minimizers on linear manifolds. The relation 
P° >po p* P 1 is defined by the local Markov order in a vicinity of P°: 

P 1 -P° G Q(P°,P*). 

The cone Q(P°,P*) depends on P°, therefore, the relation P° > P o P » P 1 is antisymmetric locally, in a 
vicinity of P°. 

Remark 1. It is possible to generate the Markov order by the relation P° > P o P * P . let us specify the 
vicinity of P° where this relation is defined and introduce a new relation: P° > P » P 1 if P >po P » P 1 for 
all i,j = 1, . . . ,n and 




Pi Pj 



> 0. 



This condition means that the pairs of numbers (jjfc, ji) and (£§■, ^j) cannot have an opposite order on the 
real line. The closed transitive closure of the relation P° > P » P 1 is the Markov order P° >~ P * P 1 

Let I be a linear manifold in the space of distributions. By definition, P° G L is a minimal point on 
L n A" -1 with respect to the order if and only if there is no point P 1 G L n A" -1 , P 1 7^ P° such that 
P° P 1 . 

Corollary 1. P° g L n A" -1 is a minimal point on L(~] A™ -1 with respect to the (global) Markov order if 
and only if it satisfies the local minimum condition Q). 

Proof. If P° G L n A™ -1 is a minimal point on L n A" -1 with respect to the (global) Markov order then 
it satisfies the condition Q due to the definition of the Markov order through the transitive closure of the 
relation P° P 1 and Propositions [1] and [2 

Let P° satisfy the local minimum condition ((4J. Then there exists a divergence Ph(P||P*) with strictly 
convex h(x) (x > 0) such that P° is a local minimum of Hh(P\\P*) on L. Because of strong convexity, this 
local minimum is a global one. Hh(P\\P*) is a Lyapunov function for all Markov chains with equilibrium P*. 
Therefore, a broken line, which is combined from solutions of the Kolmogorov equations for such Markov 
chains and starts at P°, leaves a small vicinity of P° (Propositions!]] and[5]) and never returns in a sufficiently 
small vicinity of L. Thus, for the closed transitive closure of the relation P >- P » P', point P° is a minimal 
point on I. □ 

Of course, there may be infinitely many minimal points of the Markov order on L and each of them 
corresponds to a different Lyapunov functions Hh(P\\P*). 

Another remarkable order on the space of distributions is P° >h.p* P 1 if for all strictly convex functions 
h{x) (x > 0) 

H h (P°\\P*) >H h (P 1 \\P*), 
that is, P 1 is closer to equilibrium than P° with respect to all divergencies Hh(P\\P*). 
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Figure 4: The set {P \ P° >h.p*} f°r different P° and for the Markov chains with three states and equilibrium (p* = 1/3): 
(a) {P | P° >h.p*} is convex, (b) it is not convex. The border of the set {P | P° >h,p*} is highlighted by bold lines. The 
arrows on these lines correspond to the directions of the extreme rays of the cones Q(P, P*) (i.e. the angles represented in 

Fig. m ■ 



Corollary 2. For any linear manifold L in the distribution space the minimal elements of the Markov order 
>~p* on L n A™ -1 coincide with the minimal elements of the order >h.p* on L PI A" -1 . 

Proof. We just have to combine Theorem [T] with Corollary Q] □ 

Thus, the minimal elements of the orders >-p* and >h,p* on the linear manifolds coincide. Nevertheless, 
it is necessary to mention the difference between these orders. Let P° be a distribution. For >h.p* the set of 
distributions {P \ P a > H ,p* P} is convex as an intersection of convex sets {P \ Hh(P°\\P*) > Hh(P\\P*)} for 
various strictly convex h. This is not the case for the Markov order. The set of distributions {P \ P° y P * P} 
may be non-convex. The examples may be extracted from the papers [28l [3l| (see Fig. [4j. 

Corollary 3. Let P° y P . P 1 . Then P 1 e P° + Q(P°,P*). 

Proof. Let us apply Corollary [T] to all support hyperplanes L of the convex set (P° + Q(P°, P*)) for which 
(P + Q(P°,P*))nL = {P }. □ 



4. Example: generalization of the normal distribution 

In this section, we discuss distributions p(x) on a continuous space of states, the non-negative real semi- 
axis, = {x | x > 0}. We have in mind two classical examples of distributions of the quantity bounded 
from below: energy (physics) and wealth (economics and microeconomics). 

Let two moments be fixed, the total probability Mq = J Q p(x) dx and the average quantity Mi = 
J °° xp(x) dx. The conditional maximization of the classical Boltzmann-Gibbs-Shannon entropy gives: 



p(x)(]np(x) — 1) dx — > min for given 



p(x) = M , / p{x) dx = Ah ■ 



lnp(x) = Ao + Aix, p{x) = exp(Ao + Aix), exp An = — Ai expAo = MiA^ ; 



*t \ 1 ( x 
P(x) = — exp — 



(23) 



This Boltzmann distribution appears always as a first candidate for the equilibrium distribution of an 
additive conserved quantity bounded from below. Khinchin (1943) clearly explained this law as a version of 
the limit theorem [321 ] . 

Technically, it is not difficult to involve the higher moments and obtain the distribution of the form 



p(x) = exp(A + Xix + X 2 x 2 + . . . + X r x r ) 



(24) 
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One can expect that this extension of the set of moments may improve the description. This is a traditional 
belief in Extended Irreversible Thermodynamics (EIT) Q . 

There may be many different approaches to evaluation of the quality of the approximation (|24j) but at 
least one important property of these functions is wrong: the asymptotic behavior at large x is p(x) x 
exp(— const x x r ). These "super-light" tails of the distribution p{x) change qualitatively with the change of 
the order r in (f2~4"|). 

If we use, for example, the "regularizing" forth moment in the moment chain for the Boltzmann equation 
[33| then we corrupt the e - constxv tails of the Maxwell distribution. Therefore, other approaches which do 
not modify the tails of the distribution qualitatively (like Q) may be more appreciated. 

The asymptotic behavior of the distribution's tails was thoroughly studied in many cases. Very often, 
the tails of the distributions are, without a doubt, heavier than normal c ~ constxx anc | definitely are not 
cut as e - constxx . For example, it is demonstrated that the distribution of money between peop le has the 
exponential tail with a possible transformation into a heavier power tail for very rich people [34| . 

The general solution with the Boltzmann equilibrium (|23[) gives the following expression instead of 

(23 

p(x) = g(X + X%x + X 2 x 2 + . . . + A r a; r )-^-cxp f _ ^"J > 
where g is a monotonically increasing function. In particular, for the moments Mq, Mi and M2 we obtain 

p(x) = g(X + Xix + X 2X 2 )^- cxp (~J^j ■ ( 25 ) 

There arc four qualitatively different cases of d23). Let A 2 ^ and fx = Then 

(26) 



v (x ) = cxp , 



and 

1. if fx < and A2 > then f(x) is a monotonically increasing function on [0, 00); 

2. if a < and A2 < then f(x) is a monotonically decreasing function on [0, 00); 

3. if fx > and A2 > then f(x) is a monotonically increasing function on [/x, 00) and f(x) = f(2fx — x) 
for x G [0, jLt]; 

4. if fj, > and A2 < then f(x) is a monotonically decreasing function on [pi, 00) and f(x) = /(2/i — x) 
for :r € [0, fj]. 

Each of these "generalized normal distributions" (|2l)]) is a minimizcr of the corresponding /-divergence. For 
the construction of such a divergence in general case, it is convenient to define the convex functions h in (TTJ 
with values on an extended real line with additional possible value +00. This is a natural general definition 
of convex functions [3(|. In case 1 (/i < 0, A2 > 0, and / increases), we can take in (|23|) . ([21)]) without loss of 
generality fx = 0, /(x) = g(x 2 ), and <?(y) = f(^/y)- The monotonically increasing function is, therefore, 
defined on [0, 00) with the set of values [g,g), where g = /(0) > 0, g = \im x ^ Y r XJ f(x) > and the upper 
limit may be finite or infinite. The inverse function £(z) is defined for z G [g,g) with the interval of values 
[0, 00). Let us take 

f if z < £; r if z < 5; 

fe'(z) = ^ if ze^j); h{z)=l //C(0dc if«G|g^]; (27) 

[00 if 2 > 3; [00 if z > 5. 

The improper integral J s £(?)d? may take finite or infinite values. 
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Similarly, in case 2 (/x < and A2 < 0, / decreases) we define g(y) = f(y/—y) for y 6 (— 00, 0]. The 
function 17(1/) monotonically increases and takes values on (<?,<?], where g = linx E _>. 00 /(x) and g = /(0). The 
inverse function £(z) is defined for z G (3,3] with the interval of values (— 00, 0]. In this case, we can take 

{-00 if z < g; f 00 if 2; < g\ 

£(z) itze(g,g]; h(z) = \ -/JtfO<k if 26^]; (28) 
if z > g; [ if z > g. 

In case 3, the construction is almost the same as for the case 1 but f(x) = g((x — /i) 2 ) and g(y) = 
f{ s /y + /i). In this case, g(y) is a monotonically increasing function defined on the interval [0,oo) with the 
set of values [g,g), where g = f(n) and g = lirn^oo f(x). Similarly, for case 4, the construction of h(z) is 
almost the same as in case 2. 

Thus, for every distribution in the form (|26p we can find a /-divergence Hh(P\\P*), which conditional 
minimization produces this distribution. For example, if in (|26[) f(x) = ax@ then we can take h in the form 
h{z) = -fi s (z/a) 1 +W. 

5. Conclusion 

The Maxallent approach aims to bring some order to the modern anarchy of the measures of disorder. If 
there is no clear idea which entropy is better then we have to use all of them together. 

The Markov order approach was also proposed as an alternative to the entropic anarchism. It is based 
on the idea that the disorder has to increase in random processes with given equilibrium distribution, which 
is considered as the maximally disordered state. Here, we have proved that these two approaches produce 
the same conditional minimizers on the planes of given values of moments (Theorem [lj . 

In this paper, we have considered several relations between positive distributions: 

1. P° >- P , P 1 if there exists a Markov chain with equilibrium P* such that for the solution of the 
Kolmogorov equation P(t) with P(Q) = P° we have P(l) = P 1 ; 

2. P° P 1 if there exist integrable bounded functions gy (i) = 1, . . . ,n, i 7^ j, t > 0) such that 
qij(t) satisfy the balance condition ([6]) for given P* (p* > 0) (for all t > 0), and P(l) = P 1 for solution 
P(t) of the equations 

~§£ = (Hii^Pi - Qjii^Pi) (« = 1, • • • , n) 

with P(0) = P° (that is, y p* is the transitive closure of ); 

3. P° >h,p* P 1 if H h (P°\\P*) > H^P 1 ^*) for all strictly convex functions h(x) on a semi-axis x > 0. 

4. P° > P o P , P 1 if P 1 - P° e Q(P°,P*), where Q(P°,P*) is the cone of possible velocities dP/di (T3} 
at point P° for all Markov chains with equilibrium P* . 

All these relations are different. Three of them are antisymmetric, and one, P° > P a P , P 1 , is locally 
antisymmetric, in a vicinity of P . Their interrelations are described by the follows implications: 

(P° >-$,. P 1 ) (P° P 1 ) (P° >H, P , P 1 ) => (P° >pO iP . P 1 ) . 

The local Markov order P° >p* P 1 is the weakest and the connection by a solution of the Kolmogorov 
equation P° y° pt P 1 is the strongest of these relations. Nevertheless, locally, in a small vicinity of a positive 
non-equilibrium distribution P , these relations coincide and they define the same set of locally minimal 
distributions on a linear manifold of conditions L (Propositions [TJ [21 Theorem [TJ Corollaries [TJ [5] and [21) • 

Of course, there is the other, the classical way to reduce the variability of the measures of disorder. The 
divergences H(P\\P*) can be defined by their main properties. This is an axiomatic approach: we postulate 
some "natural properties" of the divergence, then find the divergences with these properties, evaluate the 
result and decide whether we have to change the system of axiom or not. The axiomatic approach to 
definition of entropy was used by Shannon [HI] and elaborated in detail by Khinchin [35[ . 

Two distinguished additivity properties are important for the Maxent reasoning: 
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• Additivity on the algebra of states: H(P\\P*) is a sum in states 

H(P\\P*)=J2v(Pi,Pl)- 

i 

• Additivity with respect to the joining of independent subsystems. This means that if P and P* are 
products of distributions then H(P\\P*) is the sum of the corresponding entropies: if P = (pji) = (qjri) 
and P* = (pt,) = {q*r* ) then H(P\\P*) = H{Q\\Q*) + H(R\\R*). 

The first additivity property implies that the restriction of the Maxent distribution on a subset of the event 
space fl is also a Maxent distribution if the condition functionals are also additive on the algebra of states. 
The second additivity property implies that the Maxent distribution is a product of the Maxent distributions 
of subsystems if the condition functionals are additive with respect to the joining of subsystems and the 
equilibrium is a product of distributions. For more details we refer, for example, to the review in [23| . 

If we join the first additivity property with the requirement that the divergence should be a Lyapunov 
function for all Markov chains with equilibrium P* then we get Hh(P\\P*) of the form (JT|) [2l|, [22], . If 
we add the second additivity property and require continuity of Hh(P\\P*) for all values of P (including 
vectors with some pi = 0) then the classical Boltzmann-Gibbs-Shannon relative entropy will be the only 
possibility (that is, Hh{P\\P*) with h(x) = xlnx up to unimportant constant factors and summand). If 
we relax the requirement of the continuity to the set of strictly positive distributions then we will get the 



one-parametric family Hh(P\\P*) with h(x) = f3x\nx — (1 — /3)lnx [2ll. |23|. 

Let us accept the point of view that the divergency is an order. Then the values are not important and 
all the divergencies connected by a monotonic transformation of a scale, H = f(H') (with a monotonically 
increasing /), are equivalent. If the first additivity property is valid in one scale, and the second may 
be valid in another one, then one more one-parametric family appear, the Cressie-Read divergences (see 
Appendix A) [2l|, . The Tsallis entropy is a particular case of them. The Boltzmann-Gibbs-Shannon 
relative entropy (or the Kullback-Lciblcr entropy, which is the same), the convex combination of Hh(P\\P*) 
and Hh{P*\\P) for h(x) = xlnx, and the Cressie-Read divergences (including the Tsallis relative entropy) 
form the "entropic aristocracy" distinguished mostly by the additivity properties. 

If we accept the additivity on the algebra of states (i.e., the trace form) and the additivity with respect 
to joining of independent subsystems, both, then we have to use some of these functions. If additivity with 
respect to joining of independent subsystems seems to be too restrictive then we have to take the wider class 
of divergencies, for example, Hh(P\\P*) of the form ([T]). If we reject the requirement of the trace form then 
the variety of the admissible divergences becomes even richer. This uncertainty in the choice of divergence 
forces us to use the Maxallent approach. 

The Maxallent approach produces a set of conditionally maximally disordered distributions instead of 
a single distribution that maximizes a selected distinguished entropy in the usual Maxent method. These 
Maxallent sets of distributions may be considered as probabilistic analogues of the type-2 fuzzy sets intro- 



duced by L. Zadeh [36[ to capture the uncertainty of the fuzzy systems. The Maxallent approach is invented 
to manage the uncertainty of the measures of uncertainty. If there is no uncertainty of uncertainty then the 
set of distributions reduces to a single distribution. 

The decomposition theorems for Markov chains provide us with tools for the efficient calculation of the 
Markov order. Following [27j . we compare the general Markov chains and the reversible chains with detailed 
balance. For any general chain there is a reversible chain with the same velocity vector at a given point. The 
classes of general and reversible chains locally coincide because they have the same cone of possible velocities 
at every non-equilibrium distribution (the second decomposition theorem, Appendix B). This theorem gives 
us the possibility to describe the set of the conditionally maximally uncertain distributions combinatorially, 
in the finite form (|2T1) . 

For the classical Boltzmann-Gibbs-Shannon entropy the distribution on K + with two given moments has 
the Gaussian form a exp(— b(x— c) 2 ). The class of the Maxallent distributions on R + with two given moments 
is also simple (|26[) but much richer. It can be produced by multiplication of the Boltzmann distribution (|23p 
by a monotonic function or unimodal function (with one local maximum) or by a function with one local 
minimum. 
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There exists an attractive possibility: if a distribution can be obtained in the Maxallcnt approach then 
it is a conditional minimum of a divergence. If we find or guess a distribution of the Maxallcnt type for an 
empirical system then we can restore the divergence and then use it in the standard Maxent reasoning. 

The Maxallcnt approach is, surprisingly, efficient enough to analyze some practical problems. It gives 
an answer that does not depend on the subjective choice and, therefore, returns us to the "mission" of 
information theory: "to eliminate the psychological factors involved..." Q. At the same time, it has a solid 
basis in the theory of Lyapunov functions for the Kolmogorov equations. 

Now, essential mathematical work on the basic notion of entropy is needed. Gromov suggests that the 
natural mathematical language for this work will involve nonstandard analysis and category theory (37l |. 
These abstract languages seem to be closer to the basic intuition than the set theory of Cantor and the 
e — 8 reasoning of the classical analysis. Nevertheless, the basic idea of Maxallent is so simple and natural, 
that it should persist in the future advanced theory of entropy: order is something that decreases in Markov 
processes. 



Appendix A. The most popular examples of Hh(P \\P*) 

The most popular examples of Hh(P\\P*) are (23j: 

1. Let h(x) be the step function, h(x) = if x — and h(x) = — 1 if x > 0. In this case, 

H h (P\\P*) = - !• ( 29 ) 

i, Pi>0 

The quantity —Hh is the number of non-zero probabilities pi and docs not depend on P* . Sometimes 
it is called the Hartley entropy. 

2. h= \x-l\, 

H h {p\\p*) = Y,\Pi-p*i\; 

i 

this is the Zi-distancc between P and P* . 

3. h = x lncc, 

H h (P\\n=J2^(^\ = D KL (P\\P*)i (30) 

this is the usual Kullback-Leibler divergence or the relative Boltzmann-Gibbs-Shannon (BGS) en- 
tropy; 

4. h = — lnx, 

H h (P\\P*) = -J2pt^{^) =D KL (P*\\P); (31) 

this is the relative Burg entropy. It is obvious that this is again the Kullback-Leibler divergence, but 
for another order of arguments. 

5. Convex combinations of h = x In x and h = — In a; also produces a remarkable family of divergences: 
h = f3x]nx-(l-f3)]nx (ft € [0,1]), 

H h (P\\P*) = pD KL (P\\P*) + (1 - /3)D Kh (P*\\P) ; (32) 

this convex combination of divergences was used by Gorban in the early 1980s [HI and studied further 
by Gorban and Karlin 39]. It becomes a symmetric functional of (P,P*) for ft = 1/2. There exists a 
special name for this case, "Jeffreys' entropy" . 

6 . h=^, 

H h (P\\P*) = £ £ iPl ~f )2 = H 2 (P\\P*) ; (33) 

this is the quadratic term in the Taylor expansion of the relative Boltzmann-Gibbs-Shannon entropy, 
Dkl(P\\P*), near equilibrium. We have used its time derivative in ([8]). 
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7. h 



A(A+1) - 



H h (P\\P* 



A(A + 1) 



(34) 



this is the Cressie-Read (CR) family of power divergences 1401 ( the modern exposition of the history, 
properties and applications of these entropies is presented in |41|). For this family we use the notation 
Hqr a • If A — > then Hcr a — > Dkl(P\\P*), this is the classical BGS relative entropy; if A — > — 1 
then Hcr a — > -Dkl(-P*||-P), this is the relative Burg entropy. 

For the CR family in the limits A — s- ±oo only the maximal terms "survive" . Exactly as we get the 
limit l°° of F norms for p — > oo, we can use (A(A + l)-ffcR a) 1 ^' A ' for A — s- ±oo and write in these 
limits: 

' Pi 

.Pi 



H CRoo (P\\P*)=maxi^}-l 



(35) 



Hcr -oo{P\\P*) = max 



- 1 



(36) 



The existence of two limiting divergences Hqr ±oo seems very natural: there may be two types of 
extremely non-equilibrium states: with a high excess of current probability pi above p* and, inversely, 
with an extremely small current probability pi with respect to p* . 

a > 0, 



9. The Tsallis relative entropy [42[: h 



(x a — x) 
a-1 ' 



H h (P\\P* 



a 



(37) 



For this family we use notation Ht s a . 



Appendix B. The decomposition theorems 

The first decomposition theorem. Every Markov chain with a positive equilibrium is a conic combination 
of simple cycles with the same equilibrium. 

Proof. If a non-zero Markov chain has a positive equilibrium then it cannot be acyclic: there exists at least 
one oriented cycle of transitions with nonzero rate constants. The length of this cycle can vary from 2 to 
n. The set of all Markov chains with a positive equilibrium P* is an intersection of a linear subspacc given 
by the balance equations (J5]) with the positive orthant R"^™ -1 -*. This is a polyhedral cone which does not 
include a whole straight line. It is well known in convex geometry that every such polyhedral cone is a 
convex hull of a finite number of its extreme rays [30T] . A ray / with direction vector x ^ is a set I = {kx} 
(k > 0). By definition, it is an extreme ray of a cone Q if for any u € I and any x,y € Q, whenever 
u = (x + y)/2, we must have x,y € I. 

Any extreme ray of the cone of Markov chains with equilibrium P* is a simple cycle 
Ai k — > Ai ± with rate constants Qi j+1 ij = n/p*. Indeed, let a non-zero Markov chain Q with coefficients 
qij belong to an extreme ray of this cone. This chain includes a simple cycle with non-zero coefficients, 
Ai ± A{ k —> A{ ± (k < n, all the numbers i%, . . . , ik are different, qi j+1 i - > for j = 1, . . . , k, and 

ik+i = i\). For sufficiently small k (0 < k < k ), qi j+1 i s — ^pr > (J = 1, . . . , k). Let Q K be the same simple 

cycle with the rate constants qi j+1 ij = n/p*. Then for < n < kq vectors Q ± Q K also represent Markov 
chains with the equilibrium P*. Obviously, Q = (S+Sai±(S^Qa) ; hence, Q should be proportional to Q K , 
by the definition of extreme rays. 

So, any Markov chain with a positive equilibrium P* is a linear combination with positive coefficients 
of the cycles with the same equilibrium. This decomposition is global, it does not depend on the current 
distribution P. □ 
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The second decomposition theorem. For every Markov chain with a positive equilibrium P* and any 
probability distribution P° the vector dP / dt\ P a is a conic combination of the vectors dP/dt\ P a for the simple 
cycles of length two A4 ^ Aj with the same equilibrium. 

Proof. Let us start from a simple cycle A\ A 2 — >• . . . —> A n — > A\ with the constants qi+u = 1/p*, where 
p* > is the equilibrium. At a non-equilibrium distribution P the right hand side of equation (J5) is the 
vector dP/ dt = v„ with coordinates 

% = (Vb) = El^_El. (38 ) 

The flux Aj — > Aj + i is pj/p*. Let us find Aj with the minimum value of this flux and, for convenience, 
let us put this Aj in the first position by a cyclic permutation. We will represent the right hand side vector 
v„ in the form 

v„ = v„_i + kv 2 , 

where v„_i corresponds to the cycle of the length n — 1, A2 —> ...A n —> A2, with the rate constants 
Qi+li = l/Pi (and the cyclic convention n + 1 = 2), v 2 corresponds to the cycle of the length 2, A\ ^ A2, 
with the rate constants q 2 i = 1/Pii Q12 = I/P2J an d K — 0- Both velocities v„_i and v 2 should be calculated 
for the same distribution P. 

We find the constant k from the conditions: v„ = v„_i + KV2 at the point P, hence, the two following 
reaction schemes, (a) and (b), should have the same velocities, dP/dt: 

(a) A n XI X n Ai^A 2 and (b) A n 1, X n A 2 ;A 1 *S$ A 2 . 



From this condition, 











\Pn 


Pi J 


\P2 





The inequality k > holds because f>i/p* is the minimal value of the flux pj/p*. 

We just delete the vertex with the smallest outgoing flux from the initial cycle of length n and add a 
cycle of the length 2 with the same equilibrium. Let us repeat this operation for the remaining cycle of the 
length n — 1, and so on. At the end, the left hand side vector v n will be represented as the combination with 
positive coefficients the vectors dP/dt for the cycles of the length 2, Ai ^ Aj with the same equilibrium. 
This is the system with detailed balance. We have to stress here that the set of these transitions and the 
coefficients n depend on the current distribution P. 

For every distribution P, the velocity dP/dt of every cycle with equilibrium P* is a combination with 
positive coefficients of the velocities for some cycles of the length two Ai ^ Aj with the same equilibrium. 
Therefore, the right hand side of the Kolmogorov equation for any Markov chain with equilibrium P* also 
allows such a decomposition. 

It is necessary to stress that the decomposition of the right hand side of the Kolmogorov equation 
into a conic combination of cycles of length 2 depends on the ordering of the ratios Pi/p* and cannot be 
performed for all values of P simultaneously. □ 

For more details and further references see 12711 . 
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